\chapter{Conclusions} \label{chap:conclusions}
The 'sigle implementation' of CUDA Keccak did not achieve effective performance improvements if compared to the CPU reference version of the algorithm.\\
This result was not surprising because of several reasons:
\begin{description}
\item [Intrinsic Sequentiality of Keccak] Due to the fact that each step of the algorithm needs the results of the previous one before starting, only the operations belonging to the current step can be actually executed in parallel. This situation leads to a low exploitation of the GPU resources. As described in Section \ref{chap:design}, only 25 threads are used, and furthermore this number does not scale with the capabilities of the GPU device.
\item [Arithmetic Operations] Some operations required by the Keccak algorithm, like SHIFT-64 or bit to bit XOR, reduce the performance in terms of instructions per seconds. Trying to avoid the use of those kind of operations in the algorithm implementation is equivalent to rewriting the algorithm itself. Resuming, a few threads performing rather slow operations leads to an under-exploitation of the possibilities offered by the Cuda Framework.
\end{description}
The 'multi implementation' instead has actually obtained a significant speed-up on each GPU tested \dots
As expected, the devices with a compute capability lower than 1.3 suffered the usage of the double precision operations. However the results were acceptable also in these cases considering that speed-ups near to 3x were obtained also with compute capability 1.1 and 1.2.\\
Even thought 5x is a significant speed-up, it is not enough to justify a commercial CUDA implementation of Keccak , especially supposing a multi-threading CPU version. The reasons behind this assertion are many, but the following is probably the most important: Keccak algorithm requires an amount of memory that is unacceptable for an efficient CUDA implementation.\\
More in details, each thread of the proposed algorithm requires the declaration of at least 480 bytes of local variables. Such an amount of memory cannot be stored into the registers of the MPs and for this reason the nvcc compiler automatically stores the local variables of each thread into the Local Memory. \footnote{Local memory is a memory abstraction that implies "local in the scope of each thread". It is not an actual hardware component of the multi-processor. In actuality, local memory resides in global memory allocated by the compiler and delivers the same performance as any other global memory region. Normally, automatic variables declared in a kernel reside in registers, which provide very fast access. Unfortunately, the relationship between automatic variables and local memory continues to be a source of confusion for CUDA programmers. The compiler might choose to place automatic variables in local memory when:
\begin{itemize}
\item There are too many register variables.
\item The compiler cannot determine if an array is indexed with constant quantities. Please note that registers are not addressable so an array has to go into local memory -- even if it is a two-element array -- when the addressing of the array is not known at compile time.
\item \textbf{A structure would consume too much register space}.
\end{itemize}}
The shared kernel was an unsuccessful attempt to reduce the impact of this problem on performance. The reason behind the failure of this solution is that also the amount of shared memory per block is limited. For this reason the number of threads per block have been reduced in order to fit the availability of shared memory. The result was a very low occupancy profile with corresponding impact on performance due to the under usage of the device resources.

\section{Future Developments Suggestions}
There are several ways in which this work can be extended and improved, event though these considerations are out of the scope:
\begin{description}
\item [CuKeccak] could be designed as a brand new algorithm, well suited for parallelism, implementing the same hash-function of Keccak.
\item [OpenCL] is hardware independent and for this reason can overcome the registers problem if the algorithm will be executed on a multi-core CPU. 
\end{description}
